[W8A8 Block Linear Refactor][2/N] Remove W8A8Fp8BlockLinearOp and adopt Fp8 block linear kernel selections.#33892
Conversation
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-designed refactoring of the FP8 block-scaled linear kernel integration. By removing the monolithic W8A8BlockFp8LinearOp and introducing a new kernel abstraction layer with MMLinearKernel, the code becomes much more modular, maintainable, and extensible. The new kernel selection mechanism in init_fp8_linear_kernel is clear and correctly dispatches to different kernel implementations based on the quantization configuration. The changes are consistently applied across benchmarks, tests, and model implementation files.
I've found a few issues, including a critical one that would cause a runtime error, and a couple of high-severity issues related to correctness in tests and code robustness. After addressing these, this PR will be a great improvement to the codebase.
Signed-off-by: maral <maralbahari.98@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>
…r.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>
…kScaledMMLinearKernel.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
…block-scaled-rfc-pr Signed-off-by: maral <maralbahari.98@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
…ement for cutlass and fix type error in dynamic deepgemm/flash-infer Signed-off-by: maral <maralbahari.98@gmail.com>
…block-scaled-rfc-pr
Signed-off-by: maral <maralbahari.98@gmail.com>
…block-scaled-rfc-pr
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
…block-scaled-rfc-pr Signed-off-by: maral <maralbahari.98@gmail.com>
|
Hi @maralbahari, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: maral <maralbahari.98@gmail.com>
|
Hi @maralbahari, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
Signed-off-by: maral <maralbahari.98@gmail.com>
…pt Fp8 block linear kernel selections. (vllm-project#33892) Signed-off-by: maral <maralbahari.98@gmail.com> Signed-off-by: Maral <maralbahari.98@gmail.com>
…pt Fp8 block linear kernel selections. (vllm-project#33892) Signed-off-by: maral <maralbahari.98@gmail.com> Signed-off-by: Maral <maralbahari.98@gmail.com>
…pt Fp8 block linear kernel selections. (vllm-project#33892) Signed-off-by: maral <maralbahari.98@gmail.com> Signed-off-by: Maral <maralbahari.98@gmail.com>
…pt Fp8 block linear kernel selections. (vllm-project#33892) Signed-off-by: maral <maralbahari.98@gmail.com> Signed-off-by: Maral <maralbahari.98@gmail.com>
…pt Fp8 block linear kernel selections. (vllm-project#33892) Signed-off-by: maral <maralbahari.98@gmail.com> Signed-off-by: Maral <maralbahari.98@gmail.com> Signed-off-by: jackcfwang <jackcfwang@tencent.com>
…pt Fp8 block linear kernel selections. (vllm-project#33892) Signed-off-by: maral <maralbahari.98@gmail.com> Signed-off-by: Maral <maralbahari.98@gmail.com> Signed-off-by: jackcfwang <jackcfwang@tencent.com>
Required by NvFp4LinearKernel refactor (vllm-project#39129). Copied from upstream/main rather than cherry-picking the full W8A8 block linear refactor (vllm-project#33892, 35 files). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…selections (vllm-project#33892) Cherry-picked from upstream vllm-project/vllm@2e9034c99. Required dependency for NvFp4LinearKernel refactor (vllm-project#39129) — provides base.py, block-scaled kernel classes, and updated FP8 utils. Also synced nvfp4_emulation_utils.py for kE2M1ToFloat_handle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous cherry-pick of vllm-project#33892 overwrote NVFP4 exports from vllm-project#39129. Synced to upstream/main which has both FP8 block and NVFP4 kernel exports. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…stream regressions in attention, FP8, offloading and platform (#1338) ## Summary Fixes five regressions introduced by recent upstream vLLM changes that break HPU unit tests and model execution. ## Changes 1. **Remove `use_output` guard from HPU attention patch** — attribute removed upstream 2. **Remove `accept_output_buffer` branching from HPU MLA attention** — attribute removed upstream; unconditionally use output buffer in opaque path, direct call path manages output internally 3. **Update KV offloading connector tests** — field renames: `block_hashes` → `keys`, `block_hashes_to_store` → `keys_to_store`, config access via `kv_group_configs[0]` 4. **Register HPU FP8 block-scaled kernel + add ops test conftest** — new `_POSSIBLE_FP8_BLOCK_KERNELS` dict needs OOT entry; provide `VllmConfig` stub for ops unit tests 5. **Add `manual_seed_all` to `HpuPlatform`** — new required platform method for RNG seeding ## Upstream PRs that introduced these regressions - vllm-project/vllm#39125 — removed `accept_output_buffer` and `use_output` from attention layer (fixes 1, 2) - vllm-project/vllm#37109 — restructured `OffloadingConnectorScheduler` API (fix 3) - vllm-project/vllm#33892 — added `model_config.dtype` access in `Fp8LinearMethod.__init__` and `_POSSIBLE_FP8_BLOCK_KERNELS` (fix 4) - vllm-project/vllm#38468 — added `manual_seed_all` as required abstract method on `Platform` (fix 5) --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
Purpose
This PR refactors block scaled linear kernel into kernel abstraction.
changes:
MMLinearKernelbase interface for all linear kernels.Params,Fp8ParamsandInt8Params, classes to access layer params in structured format.DynamicMMLinearKernelwhich is a type ofMMLinearKernelwith two main properties of base and fallback kernels that are variant ofMMLinearKernel. this class switches between base and fallbackimplementations at runtime.
W8A8BlockFp8LinearOpclass.Test Plan
Cuda platfrom
run ci/cd tests.
ROCm platform:
lm_eval score RedHatAI/Qwen3-30B-A3B-FP8-block
Test Result
ROCm platform:
lm_eval score RedHatAI/Qwen3-30B-A3B-FP8-block, without AITER
W8A8 Block Linear Refactor PRs:
QuantFP8class. #33047: Moves all the quantization ops into the sameQuantFP8class. (merged)W8A8Fp8BlockLinearOpclass and updates all code paths and files that use this class.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.